It was found that training the model with all the available features using HIPPIE as a training set does not produce useful weightings. This problem was solved by re-evaluating the approach of the project and applying Bayesian principles to get a set of weights which were usable. However, this leaves the classification results almost useless. A more principled way to perform the classification would be to only apply the classification algorithm to the indirect data and integrate the direct data by hand - updating on it as evidence based on estimated error rates.

This notebook involves creating the training set for the classifier that will be trained only on indirect data. All available primary interaction databases will be integrated into this to create a list of high-confidence positive interactions. At the same time, a list of negative interactions will be generated.

The databases which will be combined to form this list of positive interactions will be:

iRefIndex, which contains:
- BIND
- BioGRID
- CORUM
- DIP
- HPRD
- InnateDB
- IntAct
- MatrixDB
- MINT
- MPact
- MPIDB
- MPPI
- OPHID



In [3]:

    
cd ../../iRefIndex









    



/data/opencast/MRes



In [1]:

    
import pickle



In [5]:

    
import sys
sys.path.append("opencast-bio/")



In [6]:

    
import ocbio.irefindex



In [7]:

    
f = open("human.iRefIndex.Entrez.1ofk.pickle")
irefin = pickle.load(f)
f.close()

Writing the positive list of interactions

By using the pickled feature dictionary we can write a list of interactions with any positive listing in any of the databases integrated into iRefIndex:



In [8]:

    
import csv



In [12]:

    
f = open("human.iRefIndex.positive.pairs.txt","w")
c = csv.writer(f,delimiter="\t")
for pair in irefin.featuredict:
    pair = list(pair)
    c.writerow(pair)
f.close()

Writing the negative list of interactions

According to Qi's 2006 paper the number of negative interactions to positive is approximately 600 to 1. If we see how many positive interactions we have:



In [14]:

    
print "Number of positive interactions: {0}".format(len(irefin.featuredict.keys()))









    



Number of positive interactions: 188833



In [15]:

    
print "Number of negative interactions required: {0}".format(600*len(irefin.featuredict.keys()))









    



Number of negative interactions required: 113299800

This is not a feasible number of interactions to write to file. However, this would also be many more interactions than is required to train a classifier. Therefore, we only need to write enough negative interactions to produce training sets up to the size required during training. One million negative samples will be more than enough.



In [16]:

    
import itertools



In [25]:

    
ids = list(set(flatten(irefin.featuredict.keys())))
shuffle(ids)



In [26]:

    
f = open("human.iRefIndex.negative.pairs.txt","w")
c = csv.writer(f,delimiter="\t")
for i,pair in enumerate(itertools.combinations(ids,2)):
    if frozenset(pair) not in irefin.featuredict:
        c.writerow(pair)
    if i > 1000000:
        break
f.close()



In [28]:

    
!head human.iRefIndex.negative.pairs.txt